\documentclass{elsarticle}
\usepackage[colorinlistoftodos, textwidth=4cm]{todonotes}
\usepackage{amsmath}
\usepackage{microtype}
\journal{International Journal of Human-Computer Studies}

\begin{document}

\begin{frontmatter}
  \title{Social network analysis of Chinese Wikipedia \tnoteref{t1}}
  \tnotetext[t1]{This research is supported by the National Natural
    Science Foundation of China under Grant No.70871006.}

  \author[buaa]{Yunpeng Wu\corref{cor1}\fnref{fn1}}
  \ead{yunpeng.wu@sem.buaa.edu.cn}
  \author[buaa]{Jun Wang}
  \ead{king.wang@buaa.edu.cn}
  \author[buaa]{Lu Liu}
  \ead{liulu@buaa.edu.cn}
  
  \cortext[cor1]{Corresponding author}
  \fntext[fn1]{This is the specimen author footnote.}

  \address[buaa]{School of Economics \& Management, Beihang University, 
    Beijing 100083, P.R. China}
  
  \begin{abstract}
    
  \end{abstract}

  \begin{keyword}
    Social network analysis, Chinese Wikipedia
  \end{keyword}

\end{frontmatter}

\listoftodos
\section{Introduction}
\label{sec:introduction}
\todo[color=yellow!40]{Abstract needed} %
Nowadays Wikipedia has been the world largest online
encyclopedia. Millions of people work together to add neutral account
of all human knowledge \todo{refernce needed} to it
even without knowing each other. While it has been increasingly a
subject of scientific study, both qualitative and quantitative [15]
most scholars see Wikipedia just as a complex network constituting by
the pages,thus the
properties and topolg of the network are the most common research
topic. The most interesting thing is how
these different people organized and collaborate collectively. To what extent can
these participants involve in this activity and what the category they
fall into. Although social network analysis has been developed a few
decades and applied to research from SNS to e-commerce, few articles
concentrate of the social organization on Wikipedia.

One of the most interesting questions concerning to Wikipedia is who
wrote the huge content. According to Wikipedia statistics until now
Chinese Wikipedia already have 176 thousand articles\cite{wikistat}.
While  register of Chinese Wikipedia has hit to  12
thousand, not all of them get involved in content contribution and
a lot of anonymous contributors also wrote some of the
articles. Someone pointed just a small group of people of the whole
community contribute the great part of content of Wikipedia which
indicate the 80-20 rule is also applicable to Wikipedia content
creation\cite{aswartz}. Though the claim has been questioned, no
academic research yet dive into the issue and figure out how millions
of articles are created, who wrote them and what portion of the
'active ' writers. This study try to address these issues.

\section{Chinese Wikipedia}
\label{sec:introduction-1}
Wikimedia provides the archivess of  wikipedia contents dumped
regularly from
Wikipedia's website.  There are three diffrent types of archive: 
pages\nobreakdash{-}articles contains current versons of article content,
pages-meta-current is  a   complete archive. Besides pages-articles
content, discussion and user pages are also included.  The third type
is pages-meta-history contains complete text of every revision of
every page which very suitabel  for research. We take this archive for
our study. The archive is a
compressed xml file  contains complete, raw text of every revision and
its meta data such as page title, article text, comment and revison
timestamp etc. These meta data then can be used for further research.
Wikipedia categorized its content into several categories, table 1
list all of them:
\begin{table}
  \centering
  \caption{Wikipedia category}
  \begin{tabular}[center]{|c|c|c|}
    \hline
    No  & category name  & description \\
    \hline
    1 & Media & \\\hline 
    2 &Special & \\\hline
    3 & Talk & \\\hline
    4 &  User  & \\\hline
    5 & User Talk & \\\hline
    6 & Wikipedia  & \\\hline
    7 & Wikipedia Talk & \\\hline
    8 &  Image & \\\hline
    9 & Image Talk & \\\hline
    10 &  MediaWiki& \\\hline
    11 & MediaWiki Talk& \\\hline
    12 & Template& \\\hline
    13 & Template Talk& \\\hline
    14 & Help& \\\hline
  \end{tabular}
\end{table}
We can see from table 1 most Wikipedia content contribution focus on
limited categories while other categories aim to better organize and
manage Wikipedia: contributors write page text, discuss what and how
to write at Talk and  the image of the content at Image Talk. These
three categories are regared as what a contributor trully share his
intelligence and hard work to other people. In this study, we focus
our eyes only on these categories. 
\section{Methodologies}
\label{sec:methodoligis}

We choose Chinese Wikipedia as our research target. The data archive
is  a huge XML file containing every action contributors did. Each
article has several, maybe millions, revision record the contributor
id, timestamp, action, comment and revised article content. These
information constitute basic element for our study.

\subsection{Revision similarity}
\label{sec:revision-similarity}

Each page has several, sometimes hundreds of, revisions consitute edit
history of the page. Every revision revise some content of the page:
better words, more acurate description, adding more related content
etc. From the long run, we can see the revisions as a continuous
improvement process. Each revison represents a little step of
improvement from previons work to next page revision. Of cource not
every revision advance the content quility, since Wikipedia is open to
everyone, either registered user or anonymouse user can edit it,
lowing the content quality may be inevitable at some revision,
including not
only  indiliberate mistakes but intentional troublemaking. However
this bad effect maybe erased very soon \todo{reference needed}. Thus
we can see the content is '0' at the very beginning while '1' at the
last revision of the page. The page content evolved from '0' to '1'
during several revisions. If we give a metric to measure the
similarity between  previouse revisions and the last revision, then
we can see a group of the similarity   approaching 1 from 0, which
means after a edit process, the page content is more similar to the
last revision.  If we substract the similarity of every tow conjective
revison, name it $\delta$, we see $\delta$ is the contribution a
contributor made. The $\delta$ can be either positive or
negative. Negative $\delta$ represents low quality content or mistake
should be fixed after or even vandalism. Thus we can see the set of
$\delta$ constitute a soring curve while some times get down(negative
$\delta$) or zigzag(edit war).

\subsection{Calculate similarity}
\label{sec:calculate-similarity}

Calculating similarity between tow revisions is not an easy task. The
content may not evlove lineally. Each edit not only add extra content
but adjust existing content and reorganize the content
structure. If a revision reorganized the content structure, how can we
evaluate  the contribution? Is it a great contribution since it
'rewrite' the whole content or just a minor change since little new
content added?In tis study, we see this edit is a minor change for tow
reasons:
\begin{itemize}
\item Collaberation of a encyclopedia is to collect all relative
  information and provide objective description. Content is more
  important than it's apperance. Though this kind of edit may oyganize
  the content clearly, it is not appropriate to see this is as
  valuable as content writing.
\item Thers's no algorithm can calculate the similarity of tow phrase
  of text in differerent sentence order accuratly. Also it may be
  meaningless to measure a phrase of text is more similar to original
  one than another just because the ordre of sentence sequence different. 
\end{itemize}

Since we can compute the contribution of a contributor in a page and
sum the number of edit, we can induce tow top 10 sets of
the contributor with most number of edits and contributor with the
greatest contribution. Caparing these tow sets we can see if it is the
same group of people either contribute most part of the article
meanwhile also edit the article most.   One problem need to resolve is
the number of candidates of  the set may not always    be exactly
the same size of the
set. Sometimes an article is not attractive much contributor then we
don't get 10 writers. On the other hand, number of candidates may
exceed the size of set, the choice of which one shoulde be included
will affect the final result of the page dramatically. Since the
contribution we compute is very accurate, it is not likely tow
contributor will get the same contribution in a page. The number of
edit, howerver, are much more likely to be equal. If we choose some
candicate also included in the other set, the ratio thus will be
higher than normal, otherwise will be lower. To deal with this
problem, on easy way maybe discard  the whole rest candidates, which
lead the size of set various one from another. This appraoch may harm
what we want to compare. Another thought is although we know different
choice of the rest candidate will affect the  page result, but to what
extend it will affect the whole result is still unclear. 

\todo{simplified chinese and traditional chinese}
\todo{redirect page}
\todo{robot and user-created robot}

\subsection{Redirecting and Robot}
\label{sec:redirecting-robot}

Every page of Wikipdia is about a unique but correlated topic. It
seems inevitable that tow people will create the same topic
\``simultaneously\", that is, they don't konw the existence of each
other. The tow same topic will also summon some other contributors and
keep on completing the rest of work untill someone accidentally find
there are tow pages concerning the same topic. The redudence of
content in wikipedia will negatively affect both contributors and
readers. Contributors may have to make their decisions that  which
page to follow and readers are confronted to potential information
loss. They may don't konw another page can provide some complementary
information. Redundence content may be more serious in Chinese
Wikipedia. A topic can have several versions: traditional Chinese,
simplified Chinese, original topic name, translated name
etc. Wikipedia provide a mechanism named 'Redirect  \cite{wikiredirect}
to solve this issue.  Once marked redirect, when accessing the inferior one, the page will
automatic directed to the better one. The inferior one just left edit
history and no one will add any content to it. In the data archive of
Wikipedia, there are lots orphan page left and it is important to
decide what role these page play in our study. Though edited by many
contributors, these pages are regared as incomplete, inferior quality
comparing to the redirecting one. Readers will no longer access its
content unless they are interested in the edit history. The content
are also converd and extended by the directing one which has better quality. We believe the
orphans should not be included in our research and we discard those
content from the archive.

Robots in wikipedia is an automatic tool usually created by user to do
some tough and tedious work. It participate in the edit of page and
change the content to apply for Wikipedia rules. While appeared in
neraly every page edit history and made millions of edit, robots, as
one can imagine, can not add valuable content sustantially. They just
change the \"look\" of the page. Though seen as a real user, behavior
of robot is not concerned in our research. We still compute the
contribution of robot in each page so that the contribution delta is
reflect the proper contribution of  each user.

\subsection{Banned users}
\label{sec:banned-users}
The openness of Wikipeida allows all people, whether registered or
not, can edit the page content. It seems inevitable some malicious
user will destroy Wikipedia, arbittraging distorting other's edit,
unauthorized cite protected content, use inappropriate username to
defame someone else. In general, all they have done is not to
collaberate with other people,but ruin the platform.These users will
be banned by managers when they were detected. Since the subject of
this paper is online collaberation, we don not believe these users
action should be included in our study, though these edits were
recorded in the archives. Wikipedia provides a list of these users\cite{badnames}\cite{bannedwikipedians} and
all revisions of them are excluded. Note banned users are just part of
malicious users. Excluding them does not mean there is no vandlism in
our research data. 

\subsection{Data analysis}
\label{sec:data-analysis}

After data proceesing we get selected summary data. We have collect 205378
pages into our reseach from wiki content after discard directing
ones. (205261 edit at least by 1 registered user). There are 55218 registered users and thousands of anonymous
users contribute 4592040 revisions, averaging  22 revisions per
page. Thus we just focus our eyes on processed data.\todo{189426.75
  total contribution}

\subsubsection{Powerless of anonymous users}
\label{sec:powerl-anonym-users}
Wikipedia is  a open platform that anyone can edit even without
register. In chinese Wikipedia there are 907763  revisions generated
by anonymous users, scatterring in 96389 pages, account for 1.9\% of
total revisions and 46.90\% of total pages. We see anonymous are
widely participated in various topics but  superficially get
involved. Neither they contribute many revisions nor they give high
quality content. Registered users  give 0.049 contribution per
revision on average while anonymous user only 0.008.  Since we have no
idea to identify the anonymous users just by the ip address, the
anonymous users can not be included in our collaberation study. As
00metioned before, anonymous users are trivial both in unmber of people
and contribution, so it is no harm to our analysis. 

\subsection{Similarity Analysis}
\label{sec:similarity-analysis}
\todo{show the intersection of edits and contribution}

\begin{table}
  \centering
  \caption{ratio}
  \begin{tabular}[center]{|c|c|c|}
    \hline\\
    number & count & percentage \\\hline
    0 & 287852& $65.71\%$\\\hline
    1 & 108362& $24.74\%$\\\hline
    2 & 10105& $2.31\%$\\\hline
    3 & 11055& $2.52\%$\\\hline
    4 & 9840&$2.25\%$ \\\hline
    5 & 6613& $1.51\%$\\\hline
    6 & 3033& $0.70\%$\\\hline
    7 & 957& $0.22\%$\\\hline
    8 & 197& $0.04\%$\\\hline
    9 & 18& $0.004\%$\\\hline
    
  \end{tabular}
  
\end{table}

\subsection{80-20}
\label{sec:80-20}
\todo{show whether zhwiki apply 80-20 rule}
user distribution 80-20
revisions 80-20
Wikipeida commutiny also applies for 80-20 rule as other
communities. A lot of low-level users that seldom getting involved in
page editing formulate a long tail and a few high-level users take
account for most of page edit and content contribution.
\label{sec:80-20}
\todo{show whether zhwiki apply 80-20 rule}
user distribution 80-20
revisions 80-20
Wikipeida commutiny also applies for 80-20 rule as other
communities. A lot of low-level users that seldom getting involved in

page editing formulate a long tail and a few high-level users take
account for most of page edit and content contribution. In this paper
this conclusion is also proved though through diffrent metrics. We are
not counting the amount of edit of each person to figure out the 80-20
effect. Investigating edit history In chinese
wikipedia indicates edit number is not a good metric to measure the
extend one participate in Wikipeida. A user may not as serious and
cautious when editing a page, so  he or she may keep submitting reviosions
only contains  minor fix to amend the content in previous version
without careful consideration in very short
duration. We believe edit count is not accurate enough to measure
users' participation. On the other hand, users' contribution is a good
altanatives. Contribution is  confirmed page edit. Positive
contribution means a user add supplement content and other users
approve these content and is suitable to our study. concentratioan may be more obvious than others. The
following table shows the detail:

\begin{table}[h]
  \centering
  \caption{Wikipedians group}
  \begin{tabular}[center]{|c|c|c|c|c|}
    \hline
     person contribution& $< 1$&1-200&200-900&$>900$ \\\hline
    number of people&48178&6817&125&30 \\\hline
    percentage&$87.25\%$&$12.35\%$&$0.23\%$&$0.05\%$ \\\hline
    total page&59505&154270&134108&134615 \\\hline
    percentage&$28.97\%$&$75.12\%$&$65.30\%$&$65.54\%$ \\\hline
    total contribution&3720.96&75194.85&48152.63&54919.75 \\\hline
    percentage&$2.04\%$&$41.32\%$&$26.46\%$&$30.18\%$ \\\hline 
  \end{tabular}

\end{table}
We see from table.1 Wikipedians of Chinese Wikipeida polarized into
tow groups more extremely. Most of them, $87.25\%$ according to the
table are just  peripharal to the core editors. The page they edit are
somewhat small portion of the all and their contribution is so trivial
that can be even  ignored. At the another end of the spectrum is a
much  smaller group but constitute the main content contribution of
the community. We subdivide the group into 3 subgroups. The first
group is again constituted by the most of users but these users are no
longer total novice which just join Wikipeida for fun or curious. They
are truly actual contributor to Wikipedia. Different to the
'low-level' group, this group of user contribut mopre than $40\%$
contribution which is the biggest part of the all group. Without the
effort of this group, Wikipedia may not as abudant and pop as it
is. We also see there are a very small group of people -- only 30 --
which seems having   infinity energy and passion. They coverd a
majority of page edit and contribute neraly $1/3$ content of
Wikipedia. These users are the top users of the community.

\section{Behavior patten}
\label{sec:behavior-patten}

Since we have alredy categorized Wikipedian into 4 group, It is
intersting to study the behavior pattenn seperately. Though the
low-level users can be ignored with no harm with respect to
contribution, the percentage of this group indicate their behavior may
somehow affect the whole community.Table 3 is the summary of the
gorup:
\begin{table}
  \centering
  \caption{Summary}
  \begin{tabular}[h]{|c|c|c|c|c|}
    \hline
   & Mean value&Median value&Max value&Min value \\\hline
   total contribution& 0.077&0.008&0.999&-859.501 \\\hline
   page contribution& 0.0268&0.0017&1.640&-7.844 \\\hline
  \end{tabular}

\end{table}
Not surprisingly, this gorup is made up of novice, irresponsible
author and spoiler etc. They never give concrete contribution to
Wikipedia and $25\%$ of them even can not  give positive contribution.



Among those pages edited by this group, some users give abnormally
high copntribution in some pages. Interestingly, these high
contribution pages are usually created by the contreibutors
themselfe. That is, they created those pages and write most of the
content of the pages. Table 4 compare these self-create-and-edit pages
and the rest of pages.
\begin{table}[h]
  \centering
  \begin{tabular}{|c|c|c|c|c|}
    \hline
    &mean value&median value&max value&min value \\\hline
    self-create-and-edit pages&
  \end{tabular}
  \caption{ttt}
\end{table}

The result indicates the gorup can also be subdivided into tow
groups. Though they share common attributes like low total
contribution and small amount of edits, the tow sub-groups behavior
differently. The one gourp seem has no preferred subject and no
incentive to build one. They
accidentally found a topic they believe they have extra konwledge to
add to, though it is not always true. The other one pay little
attention to other topics and just concentrate on a topic what they
like. They always create these pages by themself and wirte most of the
content. The legnth of the contetn is short in most cases comparing
with those created by high-level users.

  
\subsection{New Article Creation}
\label{sec:new-article-creation}
\bibliographystyle{elsarticle-num}
\bibliography{../../bibtex/elsevier,../../bibtex/emerald,../../bibtex/chinese,../../bibtex/jstor,../../bibtex/citeseer,../../bibtex/acm,../../bibtex/wiley,../../bibtex/book,../../bibtex/thesis,../../bibtex/ebsco,../../bibtex/old,../../bibtex/ieee,../../bibtex/internet,../../bibtex/ssrn,../../bibtex/apa,../../bibtex/blackwell,../../bibtex/sage,../../bibtex/springer,../../bibtex/MESharp,../../bibtex/taylor}

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: 
